$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

Linear Regression with Fixed Nonlinear Features

The models we have been buildling are linear in the parameters $\wv$ and linear in the attributes (features) of the samples. We can make models that are nonlinear in the attributes by adding nonlinear functions of the original features.

Say we have a single feature for each sample. Our data matrix is $$ \begin{alignat*}{1} X &= \begin{bmatrix} x_0\\ x_1\\ \vdots \\ x_N \end{bmatrix} \end{alignat*} $$ We can add other powers of each $x$ value, say up to the fourth power. $$ \begin{alignat*}{1} X &= \begin{bmatrix} x_0 & x_0^2 & x_0^3 & x_0^4\\ x_1 & x_1^2 & x_1^3 & x_1^4\\ \vdots \\ x_N & x_N^2 & x_N^3 & x_N^4\\ \end{bmatrix} \end{alignat*} $$

This is simple to do in python.

X = np.hstack((X, X**2, X**3, X**4))

Which of these powers of $x$ are useful? Looking at the magnitudes of the weights is helpful, as long as the input features are standardized. We can do more than this, though. If we build multiple models from multiple bootstrap samples of the training data, we can compute confidence intervals of the weight values. If zero is not included in the range of weight values specified by a weight's 90% lower and upper confidencce limit, then we can say that we are 90% certain that the value of this weight is not zero. If the range does include zero, the corresponding feature is probably one that is not useful.

Here is some code that illustrates this whole process, including the use of the penalty on weight magnitudes controlled by $\lambda$.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import random

In [3]:
def train(X,T,lamb=0):
    means = X.mean(0)
    stds = X.std(0)
    n,d = X.shape
    Xs1 = np.insert((X - means)/stds, 0, 1, axis=1)
    lambDiag = np.eye(d+1) * lamb
    lambDiag[0,0] = 0
    w = np.linalg.lstsq( np.dot(Xs1.T,Xs1) + lambDiag, np.dot(Xs1.T,T))[0]
    return {'w': w, 'means':means, 'stds':stds, 'lambda': lamb}

def use(model, X):
    columnOfOnes = np.ones(( X.shape[0],1 ))
    Xs1 = np.insert((X-model['means'])/model['stds'], 0, 1, axis=1)
    return Xs1 @ model['w']

def rmse(A,B):
    return np.sqrt(np.mean( (A-B)**2 ))

Now, make a simple function of $x$. How about $f(x) = -1 + 0.1 x^2 - 0.02 x^3 + 0.5 n$, where $n$ is from a standard Normal distribution.


In [4]:
nSamples = 40
trainingFraction = 0.5
nModels = 1000
confidence = 90 # percent
lambdaw = 0.0 # penalty on weight magnitudes

X = np.hstack((np.linspace(0, 3, num=nSamples),
               np.linspace(6, 10, num=nSamples))).reshape((2*nSamples,1))
# T = -1 + 1 * X + 2 * np.sin(X*2) + 0.55*np.random.normal(size=(2*nSamples,1))
T = -1 + 0 * X + 0.1 * X**2 - 0.02 * X**3 + 0.5*np.random.normal(size=(2*nSamples,1))
X.shape, T.shape


Out[4]:
((80, 1), (80, 1))

In [5]:
plt.plot(X,T,'.-');


Let's add squared and cubed values of each feature.


In [6]:
#Xf = np.hstack((X, X**2, X**3, X**4, X**5))
Xf = np.hstack((X, np.sin(X)))

Divide data into training and testing sets, randomly.


In [7]:
random.sample([1,2,3,4,5],5)


Out[7]:
[1, 2, 4, 3, 5]

In [17]:
nRows = Xf.shape[0]
nTrain = int(round(nRows*trainingFraction))
nTest = nRows - nTrain
allI = range(nRows)
trainI = random.sample(allI,nTrain)
testI = list(set(allI).difference(set(trainI)))
Xtrain = Xf[trainI,:]
Ttrain = T[trainI,:]
Xtest = Xf[testI,:]
Ttest = T[testI,:]

Xtrain.shape, Ttrain.shape, Xtest.shape, Ttest.shape


Out[17]:
((40, 2), (40, 1), (40, 2), (40, 1))

In [18]:
plt.plot(Xtrain[:,0],Ttrain,'o',label='Train')
plt.plot(Xtest[:,0],Ttest,'ro',label='Test')
plt.xlim(-2,12)
plt.legend(loc='best');


Make models based on bootstrap samples of training data. models will be list of models, one for each bootstrap sample.


In [21]:
nModels = 100
models = []
for modeli in range(nModels):
    # Draw random sample (row) numbers, with repetition
    trainI = [random.randint(0,nTrain-1) for i in range(nTrain)]
    XtrainBoot = Xtrain[trainI,:]
    TtrainBoot = Ttrain[trainI,:]
    model = train(XtrainBoot,TtrainBoot,0)
    models.append(model)

In [22]:
len(models)


Out[22]:
100

In [23]:
models[0]


Out[23]:
{'lambda': 0,
 'means': array([ 4.60128205,  0.54178569]),
 'stds': array([ 3.24165317,  0.4058448 ]),
 'w': array([[-2.65151332],
        [-1.96277949],
        [ 0.7107336 ]])}

Now we will apply all of the models to the test data.


In [29]:
YAll = []
for model in models:
    YAll.append( use(model, Xtest) )
YAll = np.array(YAll).squeeze().T 
Ytest = np.mean(YAll,axis=1)
Ttest = Ttest.flatten()  ## WATCH OUT!! Next line produces wrong scalar without this line!
RMSEtest = np.sqrt(np.mean((Ytest - Ttest)**2))
print('Test RMSE is {:.4f}'.format(RMSEtest))


Test RMSE is 1.6646

In [30]:
nPlot = 100
Xplot = np.linspace(0,13.5,nPlot).reshape((nPlot,1))
Xplotf = np.hstack((Xplot,np.sin(Xplot))) #Xplot**2,Xplot**3,Xplot**4,Xplot**5))
Ys = []
for model in models:
    Yplot = use(model, Xplotf)
    Ys.append( Yplot.flatten())
Ys = np.array(Ys).T

plt.figure(figsize=(10,10))
plt.plot(Xtrain[:,0],Ttrain,'o');
plt.plot(Xtest[:,0],Ttest,'ro');
plt.plot(Xplotf[:,0],Ys,alpha=0.2);


Now let's try some other data. Here is some data recorded from EMG sensors and one column of resulting joint forces. This data is from this UCI repository. There you will see that the data has 4 channels of EMG and one of force, measured at the knee of 22 subjects.


In [31]:
!wget http://www.cs.colostate.edu/~anderson/cs480/notebooks/emg.txt
!head emg.txt


--2017-02-16 21:16:33--  http://www.cs.colostate.edu/~anderson/cs480/notebooks/emg.txt
Resolving www.cs.colostate.edu (www.cs.colostate.edu)... 129.82.45.114
Connecting to www.cs.colostate.edu (www.cs.colostate.edu)|129.82.45.114|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9279142 (8.8M) [text/plain]
Saving to: ‘emg.txt.1’

emg.txt.1           100%[===================>]   8.85M  1.32MB/s    in 6.1s    

2017-02-16 21:16:39 (1.45 MB/s) - ‘emg.txt.1’ saved [9279142/9279142]

1.500e-03 -8.000e-04 -4.600e-03 7.000e-04 3.000e+00
3.000e-03 4.500e-03 -1.580e-02 -8.000e-04 3.000e+00
3.700e-03 4.500e-03 -1.650e-02 -8.000e-04 3.000e+00
0.000e+00 3.700e-03 -1.730e-02 -1.500e-03 3.000e+00
-8.000e-04 7.000e-04 -1.200e-02 -1.500e-03 3.000e+00
0.000e+00 7.000e-04 -6.800e-03 7.000e-04 3.000e+00
-8.000e-04 7.000e-04 -2.300e-03 7.000e-04 3.000e+00
7.000e-04 -3.000e-03 -1.500e-03 7.000e-04 3.000e+00
-4.600e-03 -6.000e-03 -8.000e-04 0.000e+00 3.000e+00
-6.000e-03 -4.600e-03 1.120e-02 -8.000e-04 3.000e+00

In [32]:
data = np.loadtxt('emg.txt')

T = data[:,-1:]
X = data[:,:-1]
X.shape,T.shape


Out[32]:
((177179, 4), (177179, 1))

In [33]:
for i in range(4):
    plt.subplot(2,2,i+1)
    plt.plot(X[:,i],T,'.')



In [34]:
plt.plot(X[:100,:]);



In [35]:
for i in range(4):
    plt.subplot(2,2,i+1)
    plt.plot(np.abs(X[:,i]),np.abs(T),'.')


Now, let's try fitting a linear model to this data.


In [36]:
Xf = np.hstack((X,X**2))
model = train(Xf,T)
predict = use(model, Xf)
print(rmse(predict,T))


57.5897204972

In [37]:
X.shape


Out[37]:
(177179, 4)

In [38]:
result = []
for deg in [1,2,3]:
    Xf = X.copy()
    for d in range(1,deg):
        Xf = np.hstack((Xf, X**d))
    err = rmse( use( train(Xf, T), Xf), T)
    result.append([deg,err])
result = np.array(result)
result


Out[38]:
array([[  1.        ,  58.40233916],
       [  2.        ,  58.40233916],
       [  3.        ,  57.5897205 ]])

In [39]:
Xf.shape


Out[39]:
(177179, 12)

In [40]:
plt.plot(result[:,0],result[:,1],'o-');



In [ ]: